Objectives: The goal of this kernel is to find the best approach to identify the quality of the wine. We will go through the basic EDA and visually identify the quality via a 3D interactive plot. Moreover, I also applied multiple ML models to make the prediction respectively. Each of the models would have its own strength.
Rpart can give you a nice decision tree plot so you can see the variable more intuitively.
Random Forest is the model most of the time you can run directly with minimum amount of tuning.
xgboost is expected to produce the best result but needs a bit of tuning.
svm is an alternative approach and usually give a less correlated result.
h2o - deeplearning is one of the easiest tool to apply deep learning model. I could potentially use keras but due to the size and the structure of data. I don’t believe deep learning model would outperform xgboost in this case.
Confusion Matrix is used to evaluate the results.
If you have any question, please leave a comment and if you like the kernel, please give me an upvote~ Thanks!
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, skimr, GGally, plotly, viridis, caret, randomForest, e1071, rpart, xgboost, h2o, corrplot, rpart.plot)
#pacman::p_load(tidyverse, skimr, GGally, corrplot, plotly, viridis, caret, randomForest, e1071, rpart, rattle, xgboost, h2o)wine <- read_csv("winequality-red.csv")skim would give you the outlook of the dataset, number of observations, number of columns, the range of the variables, number of missing/ unique values, the histogram, etc.
wine %>% skim() %>% kable()## Skim summary statistics
## n obs: 1599
## n variables: 12
##
## Variable type: integer
##
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## --------- -------- --------- ----- ----- ----- --- ---- ---- ---- ----- ---------
## quality 0 1599 1599 5.64 0.81 3 5 6 6 8 <U+2581><U+2581><U+2581><U+2587><U+2587><U+2581><U+2582><U+2581>
##
## Variable type: numeric
##
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## --------------------- -------- --------- ----- ------ ------- ------ ----- ------ ----- ----- ---------
## alcohol 0 1599 1599 10.42 1.07 8.4 9.5 10.2 11.1 14.9 <U+2582><U+2587><U+2585><U+2583><U+2582><U+2581><U+2581><U+2581>
## chlorides 0 1599 1599 0.087 0.047 0.012 0.07 0.079 0.09 0.61 <U+2587><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## citric acid 0 1599 1599 0.27 0.19 0 0.09 0.26 0.42 1 <U+2587><U+2585><U+2585><U+2586><U+2582><U+2581><U+2581><U+2581>
## density 0 1599 1599 1 0.0019 0.99 1 1 1 1 <U+2581><U+2581><U+2583><U+2587><U+2587><U+2582><U+2581><U+2581>
## fixed acidity 0 1599 1599 8.32 1.74 4.6 7.1 7.9 9.2 15.9 <U+2581><U+2587><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581>
## free sulfur dioxide 0 1599 1599 15.87 10.46 1 7 14 21 72 <U+2587><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581>
## pH 0 1599 1599 3.31 0.15 2.74 3.21 3.31 3.4 4.01 <U+2581><U+2581><U+2585><U+2587><U+2585><U+2581><U+2581><U+2581>
## residual sugar 0 1599 1599 2.54 1.41 0.9 1.9 2.2 2.6 15.5 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
## sulphates 0 1599 1599 0.66 0.17 0.33 0.55 0.62 0.73 2 <U+2582><U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## total sulfur dioxide 0 1599 1599 46.47 32.9 6 22 38 62 289 <U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
## volatile acidity 0 1599 1599 0.53 0.18 0.12 0.39 0.52 0.64 1.58 <U+2582><U+2587><U+2587><U+2583><U+2581><U+2581><U+2581><U+2581>
Corrplot would give you a overview of the correlation between all the variables. It is better to know the relationship in the very beginning of your analysis.
wine %>% cor() %>% corrplot.mixed(upper = "ellipse", tl.cex=.8, tl.pos = 'lt', number.cex = .8)Correct column names and Turn quality variable into factor
colnames(wine) <- wine %>% colnames() %>% str_replace_all(" ","_")
wine$quality <- as.factor(wine$quality)I have had a quick look and found the following variables: residual_sugar, free_sulfur_dioxide, total_sulfur_dioxide, and chlorides do not have significant different across different quality. Therefore, these variables are not included in the ggpairs model. Further, I found volatile_acidity, sulphates, and alcohol have more significate different across different quality based on the graph below.
The rest of the variables are showing some difference; however, the difference among different quality is not significant enough to determine all the quality directly by using one variable. In the following EDA, I leverage the power of plotly to produce a 3D interactive graph to visually see different quality at different alcohol, volatile acidity, and sulphates levels. I am expecting that we will see the trend to some extent but should not be able to directly identify the quality by using the graph.
wine %>%
mutate(quality = as.factor(quality)) %>%
select(-c(residual_sugar, free_sulfur_dioxide, total_sulfur_dioxide, chlorides)) %>%
ggpairs(aes(color = quality, alpha=0.4),
columns=1:7,
lower=list(continuous="points"),
upper=list(continuous="blank"),
axisLabels="none", switch="both")The graph is consistent with my expectation. The combination of the three variables could help us to some extent but we are still not able to identify the quality clearly. This calls for machine learning.
I am going to conduct the machine learning part with different models and cross validation to check which model produce the best result.
wine %>%
plot_ly(x=~alcohol,y=~volatile_acidity,z= ~sulphates, color=~quality, hoverinfo = 'text', colors = viridis(3),
text = ~paste('Quality:', quality,
'<br>Alcohol:', alcohol,
'<br>Volatile Acidity:', volatile_acidity,
'<br>sulphates:', sulphates)) %>%
add_markers(opacity = 0.8) %>%
layout(title = "3D Wine Quality",
annotations=list(yref='paper',xref="paper",y=1.05,x=1.1, text="quality",showarrow=F),
scene = list(xaxis = list(title = 'Alcohol'),
yaxis = list(title = 'Volatile Acidity'),
zaxis = list(title = 'sulphates')))set.seed(1)
inTrain <- createDataPartition(wine$quality, p=.9, list = F)
train <- wine[inTrain,]
valid <- wine[-inTrain,]
rm(inTrain)The rpart plot shows alcohol, volatile acidity, and sulphates are important variables to determine the quality, which is consistent with the explorational data analysis.
# rpart
set.seed(1)
rpart_model <- rpart(quality~alcohol+volatile_acidity+citric_acid+
density+pH+sulphates, train)
rpart.plot(rpart_model)#fancyRpartPlot(rpart_model)
rpart_result <- predict(rpart_model, newdata = valid[,!colnames(valid) %in% c("quality")],type='class')
confusionMatrix(valid$quality,rpart_result)## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 0 1 0 0 0
## 4 0 0 2 3 0 0
## 5 0 0 50 18 0 0
## 6 0 0 21 38 4 0
## 7 0 0 1 13 5 0
## 8 0 0 0 1 0 0
##
## Overall Statistics
##
## Accuracy : 0.5924
## 95% CI : (0.5112, 0.67)
## No Information Rate : 0.4777
## P-Value [Acc > NIR] : 0.00257
##
## Kappa : 0.3201
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity NA NA 0.6667 0.5205 0.55556 NA
## Specificity 0.993631 0.96815 0.7805 0.7024 0.90541 0.993631
## Pos Pred Value NA NA 0.7353 0.6032 0.26316 NA
## Neg Pred Value NA NA 0.7191 0.6277 0.97101 NA
## Prevalence 0.000000 0.00000 0.4777 0.4650 0.05732 0.000000
## Detection Rate 0.000000 0.00000 0.3185 0.2420 0.03185 0.000000
## Detection Prevalence 0.006369 0.03185 0.4331 0.4013 0.12102 0.006369
## Balanced Accuracy NA NA 0.7236 0.6115 0.73048 NA
varImp(rpart_model) %>% kable()| Overall | |
|---|---|
| alcohol | 117.939988 |
| citric_acid | 42.390967 |
| density | 45.500825 |
| pH | 4.487869 |
| sulphates | 85.251621 |
| volatile_acidity | 81.500436 |
rm(rpart_model, rpart_result)Even without any tuning, randome forest produce a much improved result than the decision tree model.
# randomforest
set.seed(1)
rf_model <- randomForest(quality~alcohol+volatile_acidity+citric_acid+
density+pH+sulphates,train)
rf_result <- predict(rf_model, newdata = valid[,!colnames(valid) %in% c("quality")])
confusionMatrix(valid$quality,rf_result)## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 0 0 1 0 0
## 4 0 0 4 1 0 0
## 5 0 0 54 14 0 0
## 6 0 0 17 40 6 0
## 7 0 0 1 6 11 1
## 8 0 0 0 0 0 1
##
## Overall Statistics
##
## Accuracy : 0.6752
## 95% CI : (0.5959, 0.7476)
## No Information Rate : 0.4841
## P-Value [Acc > NIR] : 1.028e-06
##
## Kappa : 0.475
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity NA NA 0.7105 0.6452 0.64706 0.500000
## Specificity 0.993631 0.96815 0.8272 0.7579 0.94286 1.000000
## Pos Pred Value NA NA 0.7941 0.6349 0.57895 1.000000
## Neg Pred Value NA NA 0.7528 0.7660 0.95652 0.993590
## Prevalence 0.000000 0.00000 0.4841 0.3949 0.10828 0.012739
## Detection Rate 0.000000 0.00000 0.3439 0.2548 0.07006 0.006369
## Detection Prevalence 0.006369 0.03185 0.4331 0.4013 0.12102 0.006369
## Balanced Accuracy NA NA 0.7688 0.7015 0.79496 0.750000
After reviewing the result, let’s look at which variable contributes the most.
varImp(rf_model) %>% kable()| Overall | |
|---|---|
| alcohol | 194.0940 |
| volatile_acidity | 158.6520 |
| citric_acid | 127.2520 |
| density | 158.2908 |
| pH | 133.5067 |
| sulphates | 154.0123 |
varImpPlot(rf_model)rm(rf_model, rf_result)SVM provides an alternative solution; however, the result is not outstanding.
# svm
set.seed(1)
svm_model <- svm(quality~alcohol+volatile_acidity+citric_acid+
density+pH+sulphates,train)
svm_result <- predict(svm_model, newdata = valid[,!colnames(valid) %in% c("quality")])
confusionMatrix(valid$quality,svm_result)## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 0 1 0 0 0
## 4 0 0 3 2 0 0
## 5 0 0 57 11 0 0
## 6 0 0 20 41 2 0
## 7 0 0 1 12 6 0
## 8 0 0 0 1 0 0
##
## Overall Statistics
##
## Accuracy : 0.6624
## 95% CI : (0.5827, 0.7359)
## No Information Rate : 0.5223
## P-Value [Acc > NIR] : 0.0002612
##
## Kappa : 0.4339
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity NA NA 0.6951 0.6119 0.75000 NA
## Specificity 0.993631 0.96815 0.8533 0.7556 0.91275 0.993631
## Pos Pred Value NA NA 0.8382 0.6508 0.31579 NA
## Neg Pred Value NA NA 0.7191 0.7234 0.98551 NA
## Prevalence 0.000000 0.00000 0.5223 0.4268 0.05096 0.000000
## Detection Rate 0.000000 0.00000 0.3631 0.2611 0.03822 0.000000
## Detection Prevalence 0.006369 0.03185 0.4331 0.4013 0.12102 0.006369
## Balanced Accuracy NA NA 0.7742 0.6837 0.83138 NA
rm(svm_model, svm_result)xgboost with a little bit hyper-parameter tuning achieved the best result among the models so far.
# xgboost
data.train <- xgb.DMatrix(data = data.matrix(train[, !colnames(valid) %in% c("quality")]), label = train$quality)
data.valid <- xgb.DMatrix(data = data.matrix(valid[, !colnames(valid) %in% c("quality")]))
parameters <- list(
# General Parameters
booster = "gbtree",
silent = 0,
# Booster Parameters
eta = 0.2,
gamma = 0,
max_depth = 5,
min_child_weight = 2,
subsample = 1,
colsample_bytree = 1,
colsample_bylevel = 1,
lambda = 1,
alpha = 0,
# Task Parameters
objective = "multi:softmax", # default = "reg:linear"
eval_metric = "merror",
num_class = 7,
seed = 1 # reproducability seed
)
xgb_model <- xgb.train(parameters, data.train, nrounds = 100)
xgb_pred <- predict(xgb_model, data.valid)
confusionMatrix(as.factor(xgb_pred+2), valid$quality)## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 0 0 0 0 0
## 4 0 0 1 1 0 0
## 5 0 4 55 12 2 0
## 6 1 1 12 45 5 0
## 7 0 0 0 4 11 0
## 8 0 0 0 1 1 1
##
## Overall Statistics
##
## Accuracy : 0.7134
## 95% CI : (0.6359, 0.7826)
## No Information Rate : 0.4331
## P-Value [Acc > NIR] : 1.144e-12
##
## Kappa : 0.5399
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.00000 0.8088 0.7143 0.57895 1.000000
## Specificity 1.000000 0.98684 0.7978 0.7979 0.97101 0.987179
## Pos Pred Value NaN 0.00000 0.7534 0.7031 0.73333 0.333333
## Neg Pred Value 0.993631 0.96774 0.8452 0.8065 0.94366 1.000000
## Prevalence 0.006369 0.03185 0.4331 0.4013 0.12102 0.006369
## Detection Rate 0.000000 0.00000 0.3503 0.2866 0.07006 0.006369
## Detection Prevalence 0.000000 0.01274 0.4650 0.4076 0.09554 0.019108
## Balanced Accuracy 0.500000 0.49342 0.8033 0.7561 0.77498 0.993590
rm(xgb_model, xgb_pred, data.train, data.valid, parameters)h2o is one of the easiest tool to apply deep learning model; however, due to the size of the dataset, the deep learning does not outperform the xgboost model.
# h2o
h2o.init()
h2o.train <- as.h2o(train)
h2o.valid <- as.h2o(valid)
h2o.model <- h2o.deeplearning(x = setdiff(names(train), c("quality")),
y = "quality",
training_frame = h2o.train,
# activation = "RectifierWithDropout", # algorithm
# input_dropout_ratio = 0.2, # % of inputs dropout
# balance_classes = T,
# momentum_stable = 0.99,
# nesterov_accelerated_gradient = T, # use it for speed
epochs = 1000,
standardize = TRUE, # standardize data
hidden = c(100, 100), # 2 layers of 00 nodes each
rate = 0.05, # learning rate
seed = 1 # reproducability seed
)
h2o.predictions <- h2o.predict(h2o.model, h2o.valid) %>% as.data.frame()confusionMatrix(h2o.predictions$predict, valid$quality)## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8
## 3 0 0 1 0 0 0
## 4 0 0 2 1 0 0
## 5 0 4 51 14 3 0
## 6 1 1 12 43 3 0
## 7 0 0 2 5 12 0
## 8 0 0 0 0 1 1
##
## Overall Statistics
##
## Accuracy : 0.6815
## 95% CI : (0.6025, 0.7535)
## No Information Rate : 0.4331
## P-Value [Acc > NIR] : 2.987e-10
##
## Kappa : 0.4966
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.00000 0.7500 0.6825 0.63158 1.000000
## Specificity 0.993590 0.98026 0.7640 0.8191 0.94928 0.993590
## Pos Pred Value 0.000000 0.00000 0.7083 0.7167 0.63158 0.500000
## Neg Pred Value 0.993590 0.96753 0.8000 0.7938 0.94928 1.000000
## Prevalence 0.006369 0.03185 0.4331 0.4013 0.12102 0.006369
## Detection Rate 0.000000 0.00000 0.3248 0.2739 0.07643 0.006369
## Detection Prevalence 0.006369 0.01911 0.4586 0.3822 0.12102 0.012739
## Balanced Accuracy 0.496795 0.49013 0.7570 0.7508 0.79043 0.996795
rm(h2o.model, h2o.train, h2o.valid, h2o.predictions)As I expected, xgboost give the best outcome among all the models in this kernel. A better hyper-parameter tuned xgboost model/ lightgbm would potientially produce a better result. Additionalluy, an ensemble model might also potentially give improved result.
If you have any question, please leave a comment and if you like the kernel, please give a upvote~ Thanks!